By Thibault D. 03/03/2019
credit: https://www.uplacevet.com/
As a consultant specialized in helping new businesses to start in the Toronto area, we are often requiered to look at the status of the current market to provide valuable advice to our client. In this case, we were contacted to perform a survey of the veterinarian market in the city of Toronto. Our client wants to find us to recommend a good candidate neighborhood to start open their veterinarian clinic.
Our analysis will be divided in the following phases:
In this section, we load the libraries that we will be using during this project.
# data structures
print('Loading....')
import numpy as np
print('\tnumpy:\t\t',np.__version__)
import pandas as pd
print('\tpandas:\t\t',pd.__version__)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
# plot libraries
import matplotlib
print('\tmatplotlib:\t',matplotlib.__version__)
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib.colors as colors
import seaborn as sns
print('\tseaborn:\t',sns.__version__)
plt.style.use('ggplot')
# geolocalization
import folium
print('\tfolium:\t\t',folium.__version__)
import geocoder
print('\tgeocoder:\t',geocoder.__version__)
from geopy.geocoders import Nominatim
# url fetch
import requests
print('\trequests:\t',requests.__version__)
from pandas.io.json import json_normalize
# scrapping
import bs4
from bs4 import BeautifulSoup
print('\tbs4:\t\t',bs4.__version__)
# python
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
# sklearn
import sklearn
print('\tsklearn:\t',sklearn.__version__)
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
The census data is retrieved from the database made available by the city of Toronto.
The census data dated from 2016 is used. It is loaded as a *.csv file.
https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/hlt-fst/pd-pl/comprehensive.cfm
# URL to CSV from the governement of Canada: FSA census of 2016
print('Loading census data...')
URL_census = 'https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/hlt-fst/pd-pl/Tables/CompFile.cfm?Lang=Eng&T=1201&OFT=FULLCSV'
print('...Census data loaded')
We inspect the head of the dataframe in order to determine what features are contained in the dataset.
df_census_raw = pd.read_csv(URL_census)
df_census_raw.head()
We inspec the properties of the dataset using the .info() function and .shape attribute.
df_census_raw.info()
The next step is to look at missing and null values. We count how many values are null for each feature.
df_census_raw.isna().sum(axis=0)
print("The census data contains {} rows and {} columns.".format(df_census_raw.shape[0],df_census_raw.shape[1]))
We remove columns that are not meant to be used in this study, we keep:
Note: we only keep the population and the dwelling count features from the census data.
df_census_raw = df_census_raw[['Geographic code','Population, 2016','Private dwellings occupied by usual residents, 2016']]
df_census_raw.head()
The city of Toronto website contains the cat and dog registry for the years 2013 through 2017.
https://www.toronto.ca/ext/open_data/catalog/data_set_files/2017_dog_and_cat_licence_FSA.xls
https://www.toronto.ca/ext/open_data/catalog/data_set_files/2013%20licences%20sold%20by%20fsa.xls
# URL of the XLS files
URL_2017 = 'https://www.toronto.ca/ext/open_data/catalog/data_set_files/2017_dog_and_cat_licence_FSA.xls'
URL_2013 = 'https://www.toronto.ca/ext/open_data/catalog/data_set_files/2013%20licences%20sold%20by%20fsa.xls'
URL_animal = [URL_2013,URL_2017]
# load all the XLS files
list_df = []
list_skiprows = [3,2,2,2,2]
list_skip_footer = [1,1,0,0,1]
# create list of DataFrame
print('Loading census data...')
for idx,url in enumerate(URL_animal):
df = pd.read_excel(url,sheet_name='Sheet1',skiprows=list_skiprows[idx],skipfooter=list_skip_footer[idx])
df.columns = ['FSA','CAT','DOG','TOTAL']
df['Year'] = idx*4+2013
list_df.append(df)
print('...Animal registration data loaded.')
Let's print the shape of the dataframe and explore the nature of its features.
# combine the dataFrames
df_animals = list_df[0].merge(list_df[1],on='FSA',suffixes=('_2013','_2017'))
print("The animal registry data contains {} rows and {} columns.".format(df_animals.shape[0],df_animals.shape[1]))
print('Dataset info:')
print(' ')
df_animals.info()
df_animals = df_animals.drop(['Year_2013','Year_2017'],axis=1)
df_animals.head()
Note: We keep the data as is for now. We will combine the number of animals per neighborhood late.
The toronto neighborhoods are retrived from the Wikipedia page:
https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
In this section, we use the BeautifulSoup library to extract the table containing the list of Neighborhood in Toronto. The following steps are followed:
Step 1: Create a soup object that contains the webpage data.
# url to be scrapped
URL_toronto = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
# GET request
request = requests.get(URL_toronto)
data = request.text
# convert request to soup
soup = BeautifulSoup(data, "lxml")
print('BeautifulSoup created...')
Step 2: Retrieve the subset of HTML code which contains the table data.
# extract the table
match = soup.find('table',class_='wikitable sortable')
print('Table found...')
Step 3: Extract the headers from the table.
# create new dataframe used to store the table data
df_toronto_postal = pd.DataFrame(columns=['PostalCode', 'Borough', 'Neighborhood'])
print('DataFrame instantiated...')
Step 4: Extract the content of the table.
# fetch table rows tr
data_rows = soup.find('table',class_='wikitable sortable').find('tbody').find_all('tr')
# fetch table cells td
for data_row in data_rows:
data_split = data_row.find_all('td')
if len(data_split)>0:
postcode = data_split[0].text.strip()
borough = data_split[1].text.strip()
neighborhood = data_split[2].text.strip()
df_toronto_postal = df_toronto_postal.append({'PostalCode':postcode,
'Borough':borough,
'Neighborhood':neighborhood},ignore_index=True)
print('DataFrame populated...')
print("The neighborhood database contains {} rows and {} columns.".format(df_toronto_postal.shape[0],df_toronto_postal.shape[1]))
Let's print the shape of the dataframe and explore the nature of its features.
print('Dataset info:')
print(' ')
df_toronto_postal.info()
df_toronto_postal.head()
In this section, the data is processed and invalid data is eliminated. The following steps are applied:
Step 1: Delete row where the Borough is defined as "Not assigned"
# Step 1: Delete invalid input
df_toronto_postal = df_toronto_postal[df_toronto_postal['Borough']!='Not assigned'].reset_index(drop=True)
print('Unassigned neighborhoods have been removed...')
Step 2: Concatenate neighborhoods with the same PostalCode
# Step 2: group by PostalCode and Borough, then concatenate the Neighborhoods.
df_toronto_postal = df_toronto_postal.groupby(['PostalCode','Borough'])['Neighborhood'].apply(lambda x: ', '.join(x)).reset_index()
print('Neighborhoods sharing same postal code have been merged...')
We verify that there are no longer any duplicates in the 'PostalCode' columns.
print('Checking for duplicates...')
print('Are there PostalCode duplicates?',~df_toronto_postal['PostalCode'].value_counts().max()==1)
Step 3: Replace unassigned Neighborhood by the Borough name
# display record without Neighborhood
print('List of location without a neighborhood name but a borough name:')
df_toronto_postal[df_toronto_postal['Neighborhood'].str.contains('Not assigned')]
Only one record contains an unassigned Neighborhood name. We replace the missing neighborhood name with the name of the Borough.
df_toronto_postal.loc[df_toronto_postal['Neighborhood'].str.contains('Not assigned'),'Neighborhood'] = df_toronto_postal.loc[df_toronto_postal['Neighborhood'].str.contains('Not assigned'),'Borough']
print('Neighborhoods without assigned named have been replaced by their borough name...')
We verify that the data is now cleaned:
print('Checking for unassigned Neighborhood...')
print('Are there unassigned neighborhood?',~df_toronto_postal[df_toronto_postal['Neighborhood'].str.contains('Not assigned')]['Neighborhood'].count()==1)
Step 4: Verification
print("There are {} records in the DataFrame".format(df_toronto_postal.shape[0]))
print("The shape of the DataFrame is:")
print(df_toronto_postal.shape)
df_toronto_postal.head()
We need to verify that we have population data for all the postal codes stored in the df_toronto_postal dataframe. Since the census data encompass more than just the Toronto area, it needs to be filtered down.
print("There are {} postal codes in the df_toronto_postal DataFrame.".format(df_toronto_postal.shape[0]))
print("There are {} postal codes in the df_census_raw DataFrame.".format(df_census_raw.shape[0]))
print("Out of {} neighborhood in the df_toronto_postal DF, {} are also listed in the census dataset.".format(df_toronto_postal.shape[0],
df_census_raw[df_census_raw['Geographic code'].isin(df_toronto_postal['PostalCode'])].shape[0]))
List the neighborhood that does not belong to the census data.
df_toronto_postal[~df_toronto_postal['PostalCode'].isin(df_census_raw['Geographic code'])]
Upon some research on Google Maps and Wikipedia, it appears that this is a very small neighborhood (1 city blocj). Therefore, we remove it from the dataset.
df_toronto_postal = df_toronto_postal[df_toronto_postal['PostalCode'].isin(df_census_raw['Geographic code'])]
First, we filter the records to keep only the neighborhood also located in teh df_census_raw dataframe.
df_census_raw = df_census_raw[df_census_raw['Geographic code'].isin(df_toronto_postal['PostalCode'])]
print("There are {} neighborhoods in the census dataframe.".format(df_census_raw.shape))
print("There are {} neighborhoods in the postal code dataframe.".format(df_toronto_postal.shape))
print("There are {} unique postal codes in the census data frame.".format(len(df_census_raw['Geographic code'].unique())))
print("There are {} unique postal codes in the postal code data frame.".format(len(df_toronto_postal['PostalCode'].unique())))
print('Census neighborhood not in the postal dataframe:')
df_census_raw[~df_census_raw['Geographic code'].isin(df_toronto_postal['PostalCode'])]
print('Postal records not in the census dataframe:')
df_toronto_postal[~df_toronto_postal['PostalCode'].isin(df_census_raw['Geographic code'])]
Note: We inspect the population of the neighborhoods to determine if the data contains outliers.
df_census_raw.describe()
Note: Several of the neihgborhoods have a very small population (15). We inspect these records.
df_census_raw[(df_census_raw['Population, 2016']<15) | (df_census_raw['Private dwellings occupied by usual residents, 2016']<15)]
Note: We delete these records.
df_census_raw = df_census_raw[(df_census_raw['Population, 2016']>15) & (df_census_raw['Private dwellings occupied by usual residents, 2016']>15)].reset_index(drop=True)
Note: We inspect the data to validate its content.
df_census_raw.sort_values(by='Population, 2016',ascending=True).head(15)
Note: The data now makes sense.
Since we have data dated from 2013 to 2017, we need to confirm that we have the number of registered pets for each year and for each neighborhood.
# original shape
print("The animal registry data contains {} rows and {} columns.".format(df_animals.shape[0],df_animals.shape[1]))
We filter the animal registry records to only keep the ones for which the postal code is contained in the postal code and the census data frames.
# filter neighborhoods
df_animals = df_animals[df_animals['FSA'].isin(df_census_raw['Geographic code'])]
print('After match with the census dataset:')
print("The animal registry data contains {} rows and {} columns.".format(df_animals.shape[0],df_animals.shape[1]))
Note: We have data for each year and each neighborhood. We now group the data by summing each features per neighborhood.
Let's plot the total number of registered pets during each year.
df_sum_pets = df_animals.sum(axis=0).to_frame().T.drop(['FSA','TOTAL_2017','TOTAL_2013'],axis=1).T
df_sum_pets['Year'] = df_sum_pets.index.str[-4:]
df_sum_pets.columns=['Sum','Year']
df_sum_pets = df_sum_pets.reset_index()
df_sum_pets['index'] = df_sum_pets['index'].str[0:-5]
df_sum_pets = pd.pivot_table(df_sum_pets,values='Sum',index='Year',columns='index',aggfunc=np.sum)
df_sum_pets
df_sum_pets.plot(kind='bar',
figsize=(15,6),
title='Number of registered cats and dogs in Toronto');
Before performing any analysis, we need to verify that each dataset contains the same set of postalcodes.
print("The animal registry data contains {} rows and {} columns.".format(df_animals.shape[0],df_animals.shape[1]))
print("The census dataset contains {} rows and {} columns.".format(df_census_raw.shape[0],df_census_raw.shape[1]))
print("The postal dataset contains {} rows and {} columns.".format(df_toronto_postal.shape[0],df_toronto_postal.shape[1]))
We display the ones that do not appear in all three sets.
df_toronto_postal[(~df_toronto_postal['PostalCode'].isin(df_animals['FSA']))]
Note: These are neighborhoods that we have removed because they only conaints fewer than 15 inhabitants. We remove these records from the toronto postal data.
df_toronto_postal = df_toronto_postal[(df_toronto_postal['PostalCode'].isin(df_animals['FSA']))]
print('After cleaning:')
print("The postal dataset contains {} rows and {} columns.".format(df_toronto_postal.shape[0],df_toronto_postal.shape[1]))
We now have datasets that can be merged together. Before we do, we will add additional features. For instance, we want to add the latitude and longitude to the df_toronto_postal dataframe.
df_zip_toronto = pd.DataFrame(columns = list(df_toronto_postal.columns)+['Latitude','Longitude'])
print('Instantiate new dataframe:')
df_zip_toronto
In order to geolocalize the neighborhood, we use the ARCGIS Service intead of the geocoder.google. ARCGIS is more reliable and gives accurate results after a single call to the API.
Geocoder Documentation: https://media.readthedocs.org/pdf/geocoder/latest/geocoder.pdf
# set counter of API calls
api_calls = 0
for postalcode, borough, neighborhood in zip(df_toronto_postal['PostalCode'],df_toronto_postal['Borough'],df_toronto_postal['Neighborhood']):
# initialize your variable to None
lat_lng_coords = None
# loop until you get the coordinates
while(lat_lng_coords is None):
g = geocoder.arcgis('{}, Toronto, Ontario,Canada'.format(postalcode))
lat_lng_coords = g.latlng
api_calls+=1
latitude = lat_lng_coords[0]
longitude = lat_lng_coords[1]
df_zip_toronto = df_zip_toronto.append({
'PostalCode':postalcode,
'Borough':borough,
'Neighborhood':neighborhood,
'Latitude':latitude,
'Longitude':longitude
},ignore_index=True)
print('All locations have been retrieved.')
print('{} calls to the API were made.'.format(api_calls))
print('New DataFrame populated with the latitude and longitude of each neighborhood:')
df_zip_toronto.head()
print('Animal dataset:')
df_animals.head()
print('Census dataset:')
df_census_raw.head()
Note: Before we merge our datasets, we want to simplify the feature names by changing the name of the dataframe columns.
df_animals.columns = ['PostalCode','CatCount_2013','DogCount_2013','TotalPets_2013','CatCount_2017','DogCount_2017','TotalPets_2017']
df_census_raw.columns = ['PostalCode','Population2016','Dwellings2016']
print('Animal and Census dataset columns have been renamed...')
Note: We now merge the dataframe into one unique set.
df_merged = df_zip_toronto.merge(df_census_raw,on='PostalCode')
df_merged = df_merged.merge(df_animals,on='PostalCode')
print('All three datasets have been merged...')
print("The merged dataset contains {} rows and {} columns.".format(df_merged.shape[0],df_merged.shape[1]))
print('Merged dataset:')
df_merged.head()
In this section, we use the Foursquare API to retrieve the location of veterniaries in the Toronto area. Note that the search is done using the latitude and longitude of each neighborhood, the distance used in the query is defined as 1500m. Even if it results in search resutlts that do not belong to the neighborhood, it corresponds to a distance that customer will be willing to travel to visit a veterinary.
Let's explore the first neighborhood in our dataframe.
Get the neighborhood's latitude and longitude values.
neighborhood_latitude = df_merged.loc[1, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = df_merged.loc[1, 'Longitude'] # neighborhood longitude value
neighborhood_name = df_merged.loc[1, 'Neighborhood'] # neighborhood name
print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name,
neighborhood_latitude,
neighborhood_longitude))
Now, let's get the top veterinaries that are in Highland Creek, Rouge Hill, Port Union within a radius of 1500 meters.
# obtain top 100 venues in Rouge, Malvern wthin a radius of 1000 meters
radius = 2000
LIMIT = 50
VERSION = '20180604'
search_query = 'veterinarian'
url = 'https://api.foursquare.com/v2/venues/search?categoryId=4d954af4a243a5684765b473&client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(
CLIENT_ID,
CLIENT_SECRET,
neighborhood_latitude,
neighborhood_longitude,
VERSION,
radius,
LIMIT)
The following is the JSON file obtained from query to the Foursquare API.
results = requests.get(url).json()
results
print('The categorie of the venue is {}.'.format(results['response']['venues'][0]['categories'][0]['name']))
# function that extracts the category of the venue
def get_category_type(row):
"""
Function:
Retrieve the category type from a venue.
Input:
Pandas DataFrame row.
Output:
Category name.
"""
try:
categories_list = row['categories']
except:
categories_list = row['venue.categories']
if len(categories_list) == 0:
return None
else:
return categories_list[0]['name']
def getNearbyVenues(postalcodes ,names, latitudes, longitudes, radius=3000):
"""
Function:
Use Foursquare API to retrive the top 100 venues associated to a neighborhood.
Inputs:
names: pandas Series containing the names of the neighborhoods.
latitudes: pandas Series containting the latitudes of each neighborhood.
longitudes: pandas Series containing the longitudes of each neighborhood.
radius: maximum distance in meters between the center of the neighborhood and the venue.
Outputs:
pandas DataFrame containing the set of retrieved venues.
"""
# initiate an empty list of venues
venues_list=[]
index = 1
# loop through the data
for postal, name, lat, lng in zip(postalcodes, names, latitudes, longitudes):
# create the API request URL
url = 'https://api.foursquare.com/v2/venues/search?categoryId=4d954af4a243a5684765b473&client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(
CLIENT_ID,
CLIENT_SECRET,
lat,
lng,
VERSION,
radius,
LIMIT)
# make the GET request
results = requests.get(url).json()["response"]['venues']
# return only relevant information for each nearby venue
venues_list.append([(
postal,
name,
lat,
lng,
v['name'],
v['location']['lat'],
v['location']['lng'],
v['categories'][0]['name']) for v in results])
print('{}. {}: {} DONE...{} venues found.'.format(index,postal,name,len(results)))
index+=1
# create DataFrame from the list of venues
nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
# set DataFrame columns
nearby_venues.columns = ['PostalCode',
'Neighborhood',
'Neighborhood Latitude',
'Neighborhood Longitude',
'Venue',
'Venue Latitude',
'Venue Longitude',
'Venue Category']
return(nearby_venues)
We run the query unsing the Foursquare API to retrieve the veterinarians for each neighborhood.
df_toronto_vets = getNearbyVenues(postalcodes=df_merged['PostalCode'],
names=df_merged['Neighborhood'],
latitudes=df_merged['Latitude'],
longitudes=df_merged['Longitude']
)
Let's look at the distribution of veterinarians.
df_toronto_vets.head(1)
df_toronto_vets_count = df_toronto_vets['PostalCode'].value_counts()
print("The grouped venue dataset contains {} rows.".format(df_toronto_vets_count.shape[0]))
Two neighboorhood do not contain any venues, we identify them and add them manually in the venue count serie.
df_merged[~df_merged['PostalCode'].isin(df_toronto_vets_count.index)]
temp = pd.Series({'M1X':0,'M3N':0})
df_toronto_vets_count = df_toronto_vets_count.append(temp)
print('After modifications:')
print("The grouped venue dataset contains {} rows.".format(df_toronto_vets_count.shape[0]))
df_toronto_vets_count.plot(kind='hist',
figsize=(15,8),
title='Histogram of the number of veterinarians per neighborhood',
rwidth=0.95,
bins=28,
xticks=list(range(0,30)),
xlim=(0,29));
In this section, we will combine the existing features to create new ones that are judged meaningfull for our study. The followings are added:
We first merge the dataframe containing the sum of venues to the main dataframe.
df_toronto_vets_count = df_toronto_vets_count.to_frame().reset_index()
df_toronto_vets_count.columns=['PostalCode','TotalVets']
df_toronto_vets_count.head()
df_merged = df_merged.merge(df_toronto_vets_count, on='PostalCode')
df_merged_final = df_merged.copy()
FINAL DATASET
df_merged_final.head()
Note: We create new features
df_merged_final['CatPerThousandInhab_2017'] = df_merged_final['CatCount_2017'] / df_merged_final['Population2016'] * 1000
df_merged_final['DogPerThousandInhab_2017'] = df_merged_final['DogCount_2017'] / df_merged_final['Population2016'] * 1000
df_merged_final['PetPerThousandInhab_2017'] = df_merged_final['TotalPets_2017'] / df_merged_final['Population2016'] * 1000
df_merged_final['VetPerPet_2017'] = df_merged_final['TotalVets'] / df_merged_final['TotalPets_2017']
df_merged_final['VetPerInhab'] = df_merged_final['TotalVets'] / df_merged_final['Population2016']
df_merged_final['PetPerThousandDwelling_2017'] = df_merged_final['TotalPets_2017'] / df_merged_final['Dwellings2016'] * 1000
df_merged_final['VetPerThousandDwelling'] = df_merged_final['TotalVets'] / df_merged_final['Dwellings2016'] * 1000
df_merged_final['PropCats_2017'] = df_merged_final['CatCount_2017'] / df_merged_final['TotalPets_2017'] * 100
df_merged_final['PropDogs_2017'] = df_merged_final['DogCount_2017'] / df_merged_final['TotalPets_2017'] * 100
df_merged_final['PropCats_2013'] = df_merged_final['CatCount_2013'] / df_merged_final['TotalPets_2013'] * 100
df_merged_final['PropDogs_2013'] = df_merged_final['DogCount_2013'] / df_merged_final['TotalPets_2013'] * 100
df_merged_final['DeltaCats'] = df_merged_final['CatCount_2017'] - df_merged_final['CatCount_2013']
df_merged_final['DeltaDogs'] = df_merged_final['DogCount_2017'] - df_merged_final['DogCount_2013']
df_merged_final['DeltaPets'] = df_merged_final['TotalPets_2017'] - df_merged_final['TotalPets_2013']
New features added to final dataset
df_merged_final.head()
In this section we will create several map plots. In order to understrand feature distribution across the various neighborhoods, vizualisation tools are used.
# url to be scrapped
URL_geojson = 'https://raw.githubusercontent.com/ag2816/Visualizations/master/data/Toronto2.geojson'
# GET request
request = requests.get(URL_geojson)
data = request.text
print('GEOJSON file loadded...')
Usign Folium, we plot the population distribution in the Toronto area:
LATITUDE, LONGITUDE = [43.722898, -79.381049]
print('The latitude and longitude of Toronto are {}, {}'.format(LATITUDE,LONGITUDE))
state_geo = data
m = folium.Map(location=[LATITUDE,LONGITUDE], zoom_start=11)
folium.Choropleth(
geo_data=state_geo,
name='choropleth',
data=df_merged_final,
columns=['PostalCode', 'Population2016'],
key_on='feature.properties.CFSAUID',
fill_color='YlOrRd',
fill_opacity=0.7,
line_opacity=0.2,
legend_name='Total Population'
).add_to(m)
folium.LayerControl().add_to(m)
m
print('Neighborhoods with largest pet ratio (pet per thousand people:)')
df_merged_final[['PostalCode','Borough','Neighborhood','PetPerThousandInhab_2017']].sort_values(by='PetPerThousandInhab_2017',ascending=False).head(3)
print('Neighborhoods with smallest pet ratio (pet per thousand people:)')
df_merged_final[['PostalCode','Borough','Neighborhood','PetPerThousandInhab_2017']].sort_values(by='PetPerThousandInhab_2017',ascending=True).head(3)
Notes:
state_geo = data
m = folium.Map(location=[LATITUDE,LONGITUDE], zoom_start=11)
folium.Choropleth(
geo_data=state_geo,
name='choropleth',
data=df_merged_final,
columns=['PostalCode', 'DeltaPets'],
key_on='feature.properties.CFSAUID',
fill_color='BuGn',
fill_opacity=0.8,
line_opacity=0.2,
legend_name='Change in number of registered animals between 2013 and 2017'
).add_to(m)
folium.LayerControl().add_to(m)
m
From the above map, we can see that on average, neighborhoods tend to have the same number of pets. However, there are a few extreme cases.
f, ax = plt.subplots(figsize=(16,6))
ax.set_title('Boxplot of change in registered pet between 2013 and 2017')
sns.boxplot(data=df_merged_final[['DeltaCats','DeltaDogs','DeltaPets']],ax=ax,orient='h');
The following observations can be made from the above plot:
state_geo = data
m = folium.Map(location=[LATITUDE,LONGITUDE], zoom_start=11)
folium.Choropleth(
geo_data=state_geo,
name='choropleth',
data=df_merged_final,
columns=['PostalCode', 'DeltaCats'],
key_on='feature.properties.CFSAUID',
fill_color='BuGn',
fill_opacity=0.8,
line_opacity=0.2,
legend_name='Change in number of registered cats between 2013 and 2017'
).add_to(m)
folium.LayerControl().add_to(m)
m
state_geo = data
m = folium.Map(location=[LATITUDE,LONGITUDE], zoom_start=11)
folium.Choropleth(
geo_data=state_geo,
name='choropleth',
data=df_merged_final,
columns=['PostalCode', 'DeltaDogs'],
key_on='feature.properties.CFSAUID',
fill_color='BuGn',
fill_opacity=0.8,
line_opacity=0.2,
legend_name='Change in number of registered dogs between 2013 and 2017'
).add_to(m)
folium.LayerControl().add_to(m)
m
In this section, we will look at the distribution of vet in the city. Let's start by looking at the neighborhoods with the highest and lowest veterinarians per people.
df_merged_final[['PostalCode','Borough','Neighborhood','VetPerInhab']].sort_values(by='VetPerInhab',ascending=True).head(3)
df_merged_final[['PostalCode','Borough','Neighborhood','VetPerInhab']].sort_values(by='VetPerInhab',ascending=False).head(3)
We plot the data to get a better understanding of the distribution of vets throughout the city.
f, ax = plt.subplots(figsize=(16,8))
ax.set_title('Population vs. Vets')
p = sns.regplot(x='Population2016',y='TotalVets',data=df_merged_final,ax=ax);
p.text(60000,25,
'Correlation = {:.3f}'.format(np.corrcoef(df_merged_final['Population2016'],df_merged_final['TotalVets'])[0,1]),
fontsize=15);
f, ax = plt.subplots(figsize=(16,6))
ax.set_title('Boxplot of change in registered pet between 2013 and 2017')
sns.boxplot(data=df_merged_final[['VetPerInhab']],ax=ax,orient='h');
Note: The above distribution shows a uneven distribution in the proportion of vets per neighborhood.
state_geo = data
m = folium.Map(location=[LATITUDE,LONGITUDE], zoom_start=11)
folium.Choropleth(
geo_data=state_geo,
name='choropleth',
data=df_merged_final,
columns=['PostalCode', 'TotalVets'],
key_on='feature.properties.CFSAUID',
fill_color='Accent',
fill_opacity=0.8,
line_opacity=0.2,
legend_name='Number of vets in the proximity of the neighborhood'
).add_to(m)
folium.LayerControl().add_to(m)
m
We now look at the distribution of vets against the pet distribution.
Below are the neighborhoods with the highest and lowest vet to pet ratio.
df_merged_final[['PostalCode','Borough','Neighborhood','VetPerPet_2017']].sort_values(by='VetPerPet_2017',ascending=True).head(3)
df_merged_final[['PostalCode','Borough','Neighborhood','VetPerPet_2017']].sort_values(by='VetPerPet_2017',ascending=False).head(3)
f, ax = plt.subplots(figsize=(16,8))
ax.set_title('Total Registered Pets in 2017 vs. Vets')
p = sns.regplot(x='TotalPets_2017',y='TotalVets',data=df_merged_final,ax=ax,color='dodgerblue');
p.text(1500,25,
'Correlation = {:.3f}'.format(np.corrcoef(df_merged_final['TotalPets_2017'],df_merged_final['TotalVets'])[0,1]),
fontsize=15);
f, ax = plt.subplots(figsize=(16,6))
ax.set_title('Boxplot of vet per pet in 2017')
sns.boxplot(data=df_merged_final[['VetPerPet_2017']],ax=ax,orient='h',color='dodgerblue');
In this section we will investigate the distribution of cats and dogs in Toronto.
First, we look at the neighborhoods where the dogs represent more than 70% of the registered cats and dogs in 2017.
# very dog-friendly neighborhoods
df_merged_final[['PostalCode','Borough','Neighborhood','PropDogs_2017']][df_merged_final['PropDogs_2017']>70].sort_values(by='PropDogs_2017',ascending=False)
# extremely dog-friendly neighborhoods
df_merged_final[['PostalCode','Borough','Neighborhood','PropDogs_2017']][df_merged_final['PropDogs_2017']>80].sort_values(by='PropDogs_2017',ascending=False)
Notes:
f, ax = plt.subplots(figsize=(16,6))
ax.set_title('Proportion of dogs per neighborhood')
sns.boxplot(data=df_merged_final[['PropDogs_2013','PropDogs_2017']],ax=ax,orient='h',color='darkorchid');
f, ax = plt.subplots(figsize=(16,8))
ax.set_title('Proportion of dogs 2013')
p = sns.regplot(x='Population2016',y='PropDogs_2013',data=df_merged_final,ax=ax,color='darkorchid');
p.text(60000,80,
'Correlation = {:.3f}'.format(np.corrcoef(df_merged_final['Population2016'],df_merged_final['PropDogs_2013'])[0,1]),
fontsize=15);
f, ax = plt.subplots(figsize=(16,8))
ax.set_title('Proportion of dogs 2017')
p = sns.regplot(x='Population2016',y='PropDogs_2017',data=df_merged_final,ax=ax,color='darkorchid');
p.text(60000,80,
'Correlation = {:.3f}'.format(np.corrcoef(df_merged_final['Population2016'],df_merged_final['PropDogs_2017'])[0,1]),
fontsize=15);
As shown above, there is no neighborhood in 2017 where there are more registered cats than there are registered dogs. In addition, it appears that the propotion of dogs has increased between 2013 and 2017.
Moreover, the data shows that the more populated the neighborhood, the lesser is the proportion of dogs. This can be explained as dogs requiere more space and people tend to have dogs once they live in houses in neighborhood with a lesser population density.
state_geo = data
m = folium.Map(location=[LATITUDE,LONGITUDE], zoom_start=11)
folium.Choropleth(
geo_data=state_geo,
name='choropleth',
data=df_merged_final,
columns=['PostalCode', 'PropDogs_2013'],
key_on='feature.properties.CFSAUID',
fill_color='BuPu',
fill_opacity=0.8,
line_opacity=0.2,
legend_name='Proportion of dogs in 2013'
).add_to(m)
folium.LayerControl().add_to(m)
m
state_geo = data
m = folium.Map(location=[LATITUDE,LONGITUDE], zoom_start=11)
folium.Choropleth(
geo_data=state_geo,
name='choropleth',
data=df_merged_final,
columns=['PostalCode', 'PropDogs_2017'],
key_on='feature.properties.CFSAUID',
fill_color='BuPu',
fill_opacity=0.8,
line_opacity=0.2,
legend_name='Proportion of dogs in 2017'
).add_to(m)
folium.LayerControl().add_to(m)
m
sns.set(style="white")
# Compute the correlation matrix
omitted_columns = ['Latitude','Longitude','CatCount_2013','DogCount_2013',
'TotalPets_2013','PropCats_2013','PropDogs_2013','PropCats_2017',
'Borough','Dwellings2016','PostalCode']
df_temp = df_merged_final[df_merged_final.columns.difference(omitted_columns)]
corr = df_temp.corr()
# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(14, 12))
ax.set_title('Correlation between selected features',fontsize=20)
# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=1.00, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5},
annot=True,);
Now that we have a good understanding of the pet and vet distribution in the city, we leverage the power of clustering to find out which neighborhoods are the best candidate for a new veterinarian.
print('Features not included in model:\n')
omitted_columns = ['Latitude','Longitude','CatCount_2013','DogCount_2013',
'TotalPets_2013','PropCats_2013','PropDogs_2013','PropCats_2017',
'Borough','Dwellings2016','PostalCode','Neighborhood']
selected_columns = df_merged_final.columns.difference(omitted_columns)
print(omitted_columns)
Now that the data has been filtered, we copy in into a training set.
X_train = df_merged_final[selected_columns].copy()
print('Training set ready...')
print('Scaling data using standard scaler...')
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_train = pd.DataFrame(X_train, index=df_merged_final.index, columns=selected_columns)
X_train.head()
print('Data is scaled.')
In order to group similar neighborhoods, we use a K-means clustering algorithm. We iterate over the number of clusterss in order to find the best value for this hyperparameter.
# define candidates to n_clusters
num_clusters = range(1, 96)
# instantiate the K-means algorithm
kmeans = [KMeans(n_clusters=i,n_init=20,random_state=100) for i in num_clusters]
# store scores
score = [kmeans[i].fit(X_train).score(X_train) for i in range(len(kmeans))]
# compute first derivative of elbow curve
score_prime = np.array(score)
score_prime = np.diff(score_prime,n=1) / score_prime[0:-1]
# reset plotting style
plt.style.use('ggplot')
# best n_cluster value
best_param = 8
# create plot
f, axes = plt.subplots(nrows=1, ncols=2,figsize=(16, 6))
# elbow curve
axes[0].plot(num_clusters, score)
axes[0].set_xlabel('Number of Clusters')
axes[0].set_ylabel('Score')
axes[0].set_title('Elbow Curve')
axes[0].set_xticks(list(range(0,96,5)))
axes[0].plot([best_param,best_param],[min(score),max(score)],linestyle='dashed')
# first derivative
axes[1].plot(num_clusters[0:-1],score_prime)
axes[1].set_xlabel('Number of Clusters')
axes[1].set_ylabel('First derivative of score')
axes[1].set_title('First derivative of elbow curve')
axes[1].set_xticks(list(range(0,96,5)))
axes[1].plot([best_param,best_param],[min(score_prime),max(score_prime)],linestyle='dashed');
From the above plot, it appears that the best number of cluster (using the elbow method) is n_cluster=8.
# set number of clusters
kclusters = 8
# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, n_init=20,random_state=100).fit(X_train)
# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:5]
We now insert the results of the clustering to our original dataframe.
df_merged_final['Cluster'] = kmeans.labels_
print('Overview of the clustering:\n')
df_merged_final.head()
We create a map to vizualize the different clusters.
step = folium.StepColormap(['r','cyan','lime','brown',
'darkorchid','royalblue','pink','gold','gray'],vmin=0.0, vmax=8)
print('colormap used for to differentiate clusters:')
step
m = folium.Map(location=[LATITUDE,LONGITUDE], zoom_start=11)
folium.GeoJson(
state_geo,
style_function=lambda feature: {
'fillColor': step(df_merged_final.set_index('PostalCode')['Cluster'][feature['properties']['CFSAUID']]),
'color' : 'black',
'weight' : 0.5,
'fillOpacity':0.6
}
).add_to(m)
# set color scheme for the clusters
colors_array = step
rainbow = ['r','cyan','lime','gold','royalblue','darkorchid','pink','black']
# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(df_merged_final['Latitude'], df_merged_final['Longitude'], df_merged_final['Neighborhood'], df_merged_final['Cluster']):
label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
folium.CircleMarker(
[lat, lon],
radius=2,
popup=label,
color='k',
fill=True,
fill_color='k',
fill_opacity=1.0).add_to(m)
folium.LayerControl().add_to(m)
m
Let's now print the average features per cluster.
df_cluster_grouped = df_merged_final.groupby('Cluster').mean()
df_cluster_grouped = df_cluster_grouped.reset_index()
df_cluster_grouped
colors = ['red','cyan','lime','brown',
'darkorchid','royalblue','pink','gold','gray']
plt.style.use('ggplot')
f, ax = plt.subplots(figsize=(16,16))
for g in np.unique(df_merged_final['Cluster']):
ax.scatter(x = df_merged_final['TotalVets'][df_merged_final['Cluster']==g],
y = df_merged_final['TotalPets_2017'][df_merged_final['Cluster']==g],
marker = 'o',
linewidth = 0,
s = df_merged_final['Population2016'][df_merged_final['Cluster']==g]/40,
c = colors[g],
label = 'Cluster :' + str(g),
alpha = 0.6
)
ax.set_xlabel('Number of Veterinarians')
ax.set_ylabel('Number of Pets')
legend = ax.legend(frameon=True)
for legend_handle in legend.legendHandles:
legend_handle._sizes = [50]
colors = ['red','cyan','lime','brown',
'darkorchid','royalblue','pink','gold','gray']
plt.style.use('ggplot')
f, ax = plt.subplots(figsize=(16,16))
for g in np.unique(df_merged_final['Cluster']):
ax.scatter(x = df_merged_final['PetPerThousandInhab_2017'][df_merged_final['Cluster']==g],
y = df_merged_final['VetPerPet_2017'][df_merged_final['Cluster']==g],
marker = 'o',
linewidth = 0,
s = df_merged_final['Population2016'][df_merged_final['Cluster']==g]/40,
c = colors[g],
label = 'Cluster ' + str(g),
alpha = 0.6
)
ax.set_xlabel('Pet Per Thousand Inhabitants 2017')
ax.set_ylabel('Veterinarians Per Pet 2017')
ax.set_yscale("log")
ax.set_ylim([10**-4,10**0])
legend = ax.legend(frameon=True)
for legend_handle in legend.legendHandles:
legend_handle._sizes = [50]
step = folium.StepColormap(['gray','cyan'],vmin=0.0, vmax=1.0)
m = folium.Map(location=[LATITUDE,LONGITUDE], zoom_start=11)
folium.GeoJson(
state_geo,
style_function=lambda feature: {
'fillColor': step(df_merged_final.set_index('PostalCode')['Cluster'][feature['properties']['CFSAUID']]==1),
'color' : 'black',
'weight' : 0.5,
'fillOpacity':0.6
}
).add_to(m)
folium.LayerControl().add_to(m)
m
We now look at the trend in the number of registered animal for the cluster 1.
df_cluster_1 = df_merged_final[(df_merged_final['Cluster']==1)]
df_cluster_1.head(1)
df_cluster_1[['TotalPets_2013','TotalPets_2017']].set_index(df_cluster_1['PostalCode']).T
f, ax = plt.subplots(figsize=(10,10))
df_cluster_1[['TotalPets_2013','TotalPets_2017']].set_index(df_cluster_1['PostalCode']).T.plot(ax=ax)
ax.set_title('Change in number of registered pets for the neighborhoods of cluster 1')
ax.set_xlabel('Year')
ax.set_ylabel('Number of registered pets');
df_cluster_1.set_index(df_cluster_1['PostalCode'])['DeltaPets'].sort_values(ascending=False)
CONCLUSION:
With the available data, it seems that the neighborhood M9A represents the best opportunity for a new veterinarian.
m = folium.Map(location=[LATITUDE,LONGITUDE], zoom_start=11)
folium.GeoJson(
state_geo,
style_function=lambda feature: {
'fillColor': step(feature['properties']['CFSAUID']=='M9A'),
'color' : 'black',
'weight' : 0.5,
'fillOpacity':0.6
}
).add_to(m)
folium.LayerControl().add_to(m)
m